add fanout to tpu-client-next #3478

KirillLykov · 2024-11-05T10:47:11Z

Problem

Crate tpu-client-next doesn't have fan out so far.

Summary of Changes

Add this feature.

tpu-client-next/tests/connection_workers_scheduler_test.rs

tpu-client-next/src/connection_workers_scheduler.rs

alessandrod · 2024-11-05T13:13:35Z

tpu-client-next/src/connection_workers_scheduler.rs

-            // Regardless of who is leader, add future leaders to the cache to
-            // hide the latency of opening the connection.
+                tokio::select! {
+                    send_res = workers.send_transactions_to_address(new_leader, transaction_batch.clone()) => match send_res {


Right now this will block if the corresponding worker's channel is full, so
effectively the slowest leader will slow down all other leaders.

I think we should 1) increase the channel size and 2) add a
try_send_transactions_to_address that returns an error if the channel is full.
If the channel is full, I guess as a start we could log a warning and drop the
batch for that leader, but at least we don't slow down the other leaders.

Then longer term, we need to think of a better API so that the caller of the
crate can decide what should happen: drop the batch? increase the channel size?
slow down upstream?

This makes sense, I'm not sure that there is a value in having both send_transactions and try_send_transactions in the context of this crate. So probably just modify the code to use try_. This also allows to reduce select!

The size of the channel worker_channel_size is configurable through the Scheduler config. The only question what value should be used in practice for the STS, for example.

one thing that worries me is the scenario with transaction-bench. There we actually need to have a backpressure (which has been created with these worker_channel.send().await) because it tries to generate as many as possible transaction batches and we rely on slowing down sending them.
So I want to add configuration flag controlling option try_send or send

In case of STS, try_send is fine because even if we drop some transactions silently, there is a retry thread which sends them again anyways.

tpu-client-next/src/connection_workers_scheduler.rs

alessandrod · 2024-11-05T13:15:51Z

tpu-client-next/src/leader_updater.rs

@@ -35,7 +35,7 @@ pub trait LeaderUpdater: Send {
    /// If the current leader estimation is incorrect and transactions are sent to
    /// only one estimated leader, there is a risk of losing all the transactions,
    /// depending on the forwarding policy.
-    fn next_leaders(&self, lookahead_slots: u64) -> Vec<SocketAddr>;
+    fn next_leaders(&mut self, lookahead_slots: u64) -> Vec<SocketAddr>;


why this change?

Because to implement this trait in STS I need to able to modify self.

mergify · 2024-11-05T14:56:11Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

KirillLykov · 2024-11-06T16:32:22Z

@alessandrod all addressed except for 0rtt (next time)

tpu-client-next/tests/connection_workers_scheduler_test.rs

tpu-client-next/src/connection_workers_scheduler.rs

ilya-bobyr · 2024-11-06T20:46:29Z

tpu-client-next/src/send_transaction_stats.rs

+// Define the non-atomic struct and the `to_non_atomic` conversion method
+define_non_atomic_struct!(
+    SendTransactionStatsNonAtomic,
+    SendTransactionStats,
+    {


minor

I would call the macro define_non_atoic_version_for!. Just to be extra clear, that it creates a struct based on another struct.

Suggested change

// Define the non-atomic struct and the `to_non_atomic` conversion method

define_non_atomic_struct!(

SendTransactionStatsNonAtomic,

SendTransactionStats,

{

// Define the non-atomic struct and the `to_non_atomic` conversion method

define_non_atomic_version_for!(

SendTransactionStats,

SendTransactionStatsNonAtomic,

{

ilya-bobyr · 2024-11-06T20:48:55Z

tpu-client-next/src/workers_cache.rs

+            mpsc::error::TrySendError::Full(_) => WorkersCacheError::FullChannel,
+            mpsc::error::TrySendError::Closed(_) => WorkersCacheError::ReceiverDropped,


style
Consider importing mpsc::error::TrySendError.

Suggested change

mpsc::error::TrySendError::Full(_) => WorkersCacheError::FullChannel,

mpsc::error::TrySendError::Closed(_) => WorkersCacheError::ReceiverDropped,

use {

/* ... */

tokio::{sync::mpsc::{self, mpsc::error::TrySendError}, task::JoinHandle},

/* ... */

};

/* ... */

TrySendError::Full(_) => WorkersCacheError::FullChannel,

TrySendError::Closed(_) => WorkersCacheError::ReceiverDropped,

ilya-bobyr · 2024-11-06T20:51:24Z

tpu-client-next/src/workers_cache.rs

    }

    /// Sends a batch of transactions to the worker for a given peer. If the
    /// worker for the peer is disconnected or fails, it is removed from the
    /// cache.
-    pub async fn send_transactions_to_address(
+    pub(crate) fn try_send_transactions_to_address(


minor

Unrelated to this PR, but the address suffix in the name is probably unnecessary.
As the call will have the address as the first argument, it is probably readable enough if the function is called try_send_transactions() or try_send_transactions_to().

This I will not touch in this PR to minimize the number of changes (to simplify the backport 2.1)

ilya-bobyr · 2024-11-06T20:59:02Z

tpu-client-next/src/workers_cache.rs

+    if let Some(worker) = worker {
+        tokio::spawn(async move {
+            let leader = worker.leader();


style

You can reduce the whole body indentation with a let/else:

Suggested change

if let Some(worker) = worker {

tokio::spawn(async move {

let leader = worker.leader();

let Some(worker) = worker else {

return;

};

tokio::spawn(async move {

let leader = worker.leader();

ilya-bobyr · 2024-11-06T21:01:52Z

tpu-client-next/src/workers_cache.rs

+
+pub(crate) fn maybe_shutdown_worker(worker: Option<ShutdownWorker>) {
+    if let Some(worker) = worker {
+        tokio::spawn(async move {


Spawning tasks without observing their execution, I think, is a bit fragile.

The executor will keep track of them, but it is not clear in the rest of the code as to what is going on.
And if those tasks take a long time to complete or even hang, it would be just reflected in the executor taking a long time to finish or hanging.

I do not fully understand the end to end logic here.
So, it could be hard to track.
But I suggest you consider putting all the shutdown tasks in an UnorderedFutures or something like that, allowing you to track and even interrupt the shutdown process.

In particular, it would allow the shutdown errors to be processed in a centralized location, rather than them being just printed as warnings.
Though again, not sure if you really need this feature.
But, maybe for collection of the stats?

But I suggest you consider putting all the shutdown tasks in an
UnorderedFutures or something like that, allowing you to track and even
interrupt the shutdown process.

we explicitly don't want to do this tho: we don't want cleanup to delay sending
transactions. So we do need tasks. We could have a JoinSet and call
poll_join_next to poll without blocking and pop tasks off the join set, although
that seems extra work for little gain. Background tasks don't block the runtime https://docs.rs/tokio/latest/tokio/runtime/struct.Runtime.html#shutdown

tpu-client-next/tests/connection_workers_scheduler_test.rs

ilya-bobyr · 2024-11-06T21:05:50Z

tpu-client-next/tests/connection_workers_scheduler_test.rs

@@ -433,10 +436,10 @@ async fn test_staked_connection() {

    // Wait for the exchange to finish.
    tx_sender_shutdown.await;
-    let localhost_stats = join_scheduler(scheduler_handle).await;
+    let localhost_stats = join_scheduler(scheduler_handle).await.to_non_atomic();


Would it make sense to put the .to_non_atomic() call inside the join_scheduler()?
Or we do not want it there for non-test case?

You are right, it allows to avoid cloning by doing this.

KirillLykov · 2024-11-07T13:30:55Z

tpu-client-next/src/connection_workers_scheduler.rs

-            // Regardless of who is leader, add future leaders to the cache to
-            // hide the latency of opening the connection.
+            // add future leaders to the cache to hide the latency of opening the
+            // connection.
            for peer in future_leaders {


Here it established connection to future leaders

alessandrod · 2024-11-07T14:16:01Z

tpu-client-next/src/connection_workers_scheduler.rs

 /// This enum defines to how many discovered leaders we will send transactions.
 pub enum LeadersFanout {
    /// Send transactions to all the leaders discovered by the `next_leaders`
    /// call.
    All,
    /// Send transactions to the first selected number of leaders discovered by
    /// the `next_leaders` call.
-    Next(usize),
+    Next(Fanout),


You don't like Next { send: usize, connect: usize}?

Co-authored-by: Illia Bobyr <[email protected]>

alessandrod

approving with a nit. Feel free to fix the nit in one of the followups if you don't want to do another CI run

alessandrod · 2024-11-07T15:51:15Z

tpu-client-next/src/connection_workers_scheduler.rs

+            }
+
+            for new_leader in fanout_leaders {
+                if !workers.contains(new_leader) {


nit: I don't think that this can ever happen? I'd remove the code

This can happen if a connection to the leader is dropped and the worker is stopped.
This arm in the send_err match case below:

Err(WorkersCacheError::ReceiverDropped) => { // Remove the worker from the cache, if the peer has disconnected. maybe_shutdown_worker(workers.pop(*new_leader)); }

It is possible for the fanout_leaders to contain duplicates.
The duplicate would not be able to get a matching worker.

* Add tpu-client-next to the root Cargo.toml * Change LeaderUpdater trait to accept mut self * add fanout to the tpu-client-next * Shutdown in separate task * Use try_send instead, minor impromenets * fix LeaderUpdaterError traits * improve lifetimes in split_leaders Co-authored-by: Illia Bobyr <[email protected]> * address PR comments * create connections in advance * removed lookahead_slots --------- Co-authored-by: Illia Bobyr <[email protected]> (cherry picked from commit 2a618b5) # Conflicts: # Cargo.toml

ilya-bobyr · 2024-11-07T17:30:57Z

tpu-client-next/src/connection_workers_scheduler.rs

+/// * the second vector contains the leaders, used to warm up connections. This
+///   slice includes the the first set.


minor

Suggested change

/// * the second vector contains the leaders, used to warm up connections. This

/// slice includes the the first set.

/// * the second slice contains the leaders, used to warm up connections. This

/// slice includes the first set.

ilya-bobyr · 2024-11-07T17:35:55Z

tpu-client-next/src/connection_workers_scheduler.rs

                    let worker = Self::spawn_worker(
                        &endpoint,
                        peer,
                        worker_channel_size,
                        skip_check_transaction_age,
                        max_reconnect_attempts,
+                        stats.clone(),


minor

This clone() should be unnecessary - I do not see stats used in this block anymore.

Suggested change

stats.clone(),

stats,

weird that it passed clippy

ilya-bobyr · 2024-11-07T17:37:22Z

tpu-client-next/src/connection_workers_scheduler.rs

+            let (fanout_leaders, connect_leaders) =
+                split_leaders(&updated_leaders, &leaders_fanout);


It is a bit confusing that we call it send and connect portions elsewhere, but here they are called fanout and connection portions.

Maybe it would be more consistent to call it send_leaders and connect_leaders here as well?

Suggested change

let (fanout_leaders, connect_leaders) =

split_leaders(&updated_leaders, &leaders_fanout);

let (send_leaders, connect_leaders) =

split_leaders(&updated_leaders, &leaders_fanout);

ilya-bobyr · 2024-11-07T17:41:35Z

tpu-client-next/src/connection_workers_scheduler.rs

+            for new_leader in fanout_leaders {
+                if !workers.contains(new_leader) {


Similar to the fanout to send rename above, the new_leader name is a bit confusing to me here.
I would expect the new_leader name to imply we are going to open a connection to this leader or start a worker for it.

But we actually start workers in the block above.

Maybe call it send_to instead?
Or some other name that indicates that this is only a destination for the next transaction batch.
It could as well be the same leader as in the previous slot group¹.

Suggested change

for new_leader in fanout_leaders {

if !workers.contains(new_leader) {

for send_to in send_leaders {

if !workers.contains(send_to) {

Footnotes

Is there a name for a sequence of NUM_CONSECUTIVE_LEADER_SLOTS slots? ↩

ilya-bobyr · 2024-11-07T17:43:32Z

tpu-client-next/src/connection_workers_scheduler.rs

+            }
+
+            for new_leader in fanout_leaders {
+                if !workers.contains(new_leader) {


This can happen if a connection to the leader is dropped and the worker is stopped.
This arm in the send_err match case below:

Err(WorkersCacheError::ReceiverDropped) => { // Remove the worker from the cache, if the peer has disconnected. maybe_shutdown_worker(workers.pop(*new_leader)); }

It is possible for the fanout_leaders to contain duplicates.
The duplicate would not be able to get a matching worker.

KirillLykov · 2024-11-08T10:38:04Z

@ilya-bobyr yeah, in the follow up these renamings and also need to think how to have a backpressure for sending transactions. Not obvious to me so far, but this backpressure is not needed for SendTransactionService which is priority for now, only for transaction-bench.

* add fanout to tpu-client-next (#3478) * Add tpu-client-next to the root Cargo.toml * Change LeaderUpdater trait to accept mut self * add fanout to the tpu-client-next * Shutdown in separate task * Use try_send instead, minor impromenets * fix LeaderUpdaterError traits * improve lifetimes in split_leaders Co-authored-by: Illia Bobyr <[email protected]> * address PR comments * create connections in advance * removed lookahead_slots --------- Co-authored-by: Illia Bobyr <[email protected]> (cherry picked from commit 2a618b5) # Conflicts: # Cargo.toml * resolve the conflict --------- Co-authored-by: kirill lykov <[email protected]>

KirillLykov requested a review from alessandrod November 5, 2024 10:47

KirillLykov force-pushed the klykov/add-fanout-to-client-next branch from 107384f to f62dee2 Compare November 5, 2024 10:47

KirillLykov mentioned this pull request Nov 5, 2024

Enhancements for tpu-client-next #2991

Open

10 tasks

alessandrod requested changes Nov 5, 2024

View reviewed changes

KirillLykov added the v2.1 Backport to v2.1 branch label Nov 5, 2024

KirillLykov force-pushed the klykov/add-fanout-to-client-next branch 2 times, most recently from 263d796 to c0fd36e Compare November 6, 2024 16:01

KirillLykov requested a review from alessandrod November 6, 2024 16:07

KirillLykov commented Nov 6, 2024

View reviewed changes

tpu-client-next/tests/connection_workers_scheduler_test.rs Show resolved Hide resolved

ilya-bobyr reviewed Nov 6, 2024

View reviewed changes

KirillLykov force-pushed the klykov/add-fanout-to-client-next branch 2 times, most recently from 1341c0c to 2dc802c Compare November 7, 2024 07:52

KirillLykov mentioned this pull request Nov 7, 2024

use tpu-client-next in send_transaction_service #3515

Merged

KirillLykov force-pushed the klykov/add-fanout-to-client-next branch 2 times, most recently from 3ee1853 to 448d476 Compare November 7, 2024 13:27

KirillLykov commented Nov 7, 2024

View reviewed changes

KirillLykov force-pushed the klykov/add-fanout-to-client-next branch from 448d476 to 958de6f Compare November 7, 2024 14:14

alessandrod reviewed Nov 7, 2024

View reviewed changes

KirillLykov and others added 10 commits November 7, 2024 16:31

Add tpu-client-next to the root Cargo.toml

92cfd0c

Change LeaderUpdater trait to accept mut self

20a9c52

add fanout to the tpu-client-next

bb3b34f

Shutdown in separate task

42755bc

Use try_send instead, minor impromenets

85e8ae4

fix LeaderUpdaterError traits

244c4c2

improve lifetimes in split_leaders

612b611

Co-authored-by: Illia Bobyr <[email protected]>

address PR comments

ecc721d

create connections in advance

afbc3da

removed lookahead_slots

5fbfddd

KirillLykov force-pushed the klykov/add-fanout-to-client-next branch from 958de6f to 5fbfddd Compare November 7, 2024 15:40

KirillLykov added the automerge automerge Merge this Pull Request automatically once CI passes label Nov 7, 2024

alessandrod approved these changes Nov 7, 2024

View reviewed changes

mergify bot merged commit 2a618b5 into anza-xyz:master Nov 7, 2024
52 checks passed

mergify bot mentioned this pull request Nov 7, 2024

v2.1: add fanout to tpu-client-next (backport of #3478) #3523

Merged

ilya-bobyr reviewed Nov 7, 2024

View reviewed changes

KirillLykov mentioned this pull request Nov 26, 2024

Add support of tpu-client-next to validator #3454

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fanout to tpu-client-next #3478

add fanout to tpu-client-next #3478

KirillLykov commented Nov 5, 2024

alessandrod Nov 5, 2024

KirillLykov Nov 6, 2024 •

edited

Loading

KirillLykov Nov 6, 2024 •

edited

Loading

KirillLykov Nov 7, 2024 •

edited

Loading

KirillLykov Nov 7, 2024

alessandrod Nov 5, 2024

KirillLykov Nov 6, 2024

mergify bot commented Nov 5, 2024

KirillLykov commented Nov 6, 2024

ilya-bobyr Nov 6, 2024

ilya-bobyr Nov 6, 2024

ilya-bobyr Nov 6, 2024

KirillLykov Nov 7, 2024

ilya-bobyr Nov 6, 2024

ilya-bobyr Nov 6, 2024

alessandrod Nov 7, 2024

ilya-bobyr Nov 6, 2024

KirillLykov Nov 7, 2024 •

edited

Loading

KirillLykov Nov 7, 2024

alessandrod Nov 7, 2024

alessandrod left a comment

alessandrod Nov 7, 2024

ilya-bobyr Nov 7, 2024

ilya-bobyr Nov 7, 2024

ilya-bobyr Nov 7, 2024

KirillLykov Nov 8, 2024

ilya-bobyr Nov 7, 2024

ilya-bobyr Nov 7, 2024

ilya-bobyr Nov 7, 2024

KirillLykov commented Nov 8, 2024

		mpsc::error::TrySendError::Full(_) => WorkersCacheError::FullChannel,
		mpsc::error::TrySendError::Closed(_) => WorkersCacheError::ReceiverDropped,

		/// * the second vector contains the leaders, used to warm up connections. This
		/// slice includes the the first set.

		let (fanout_leaders, connect_leaders) =
		split_leaders(&updated_leaders, &leaders_fanout);

		for new_leader in fanout_leaders {
		if !workers.contains(new_leader) {

add fanout to tpu-client-next #3478

add fanout to tpu-client-next #3478

Conversation

KirillLykov commented Nov 5, 2024

Problem

Summary of Changes

Choose a reason for hiding this comment

KirillLykov Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

KirillLykov Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

KirillLykov Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Nov 5, 2024

KirillLykov commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KirillLykov Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alessandrod left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

KirillLykov commented Nov 8, 2024

KirillLykov Nov 6, 2024 •

edited

Loading

KirillLykov Nov 6, 2024 •

edited

Loading

KirillLykov Nov 7, 2024 •

edited

Loading

KirillLykov Nov 7, 2024 •

edited

Loading